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ABSTRACT 


The Fast Fourier Transform is an algorithm for the computation 
of Discrete Fourier Transforms in less time than allowed by any other 
algorithm available. The use of special purpose digital machines 
to reduce those times even further is of interest for real time 
spectral analysis. The main principles of Fast Fourier Transforms 
are presented. The design of a full-parallel eight sample processor 
is presented as a point of reference for comparison with serial and 
serial-parallel hybrid machines. Carry-Save Addition is introduced 


and used as the primary arithmetic logic. 
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TABLE OF SYMBOLS AND ABBREVIATIONS 


symbols Definition 
Couplets The combination of two samples in the first pass 


of a FFT. Referred to as a sum, or a difference, 
couplet depending on the method of combining. 


CSA Carry-Save Adders or Carry-Save Addition 
DFT Discrete Fourier Transform 
th Poche 
F (n) The n coefficient of a DFT 
FFT Fast Fourier Transform 
N The number of samples being processed to obtain 
the DFT 
-9 
nsec. Nano-second; 10 seconds 
quartets The combination of four samples in the second pass 
of a FFT 
usec, Micro-second; nes seconds 
Z(k) The ce sample of a sampled time signal 





Ll. Introduction 
In presenting a procedure that will determine the complex Fourier 
coefficients of a sampled signal, it is first necessary to introduce 


the discrete Fourier transform (DFT) and its inverse. The DFT is 


defined by 
N-1 

F(n) = pu Z(k)exp(- j2T1™k/N) Ol 2, ee N-1 (1) 
k=0 


where F(n) is ne nica eoefficient of the DFT and Z(k) the val sample 
of a N sample signal. It is assumed that the N samples are equally 
spaced and taken at a frequency that is at least twice the highest 
frequency component of a bandlimited signal. Using the principles of 


orthogonality it can be shown that the inverse of the DFT is 


N- 
Z(k) =< i F(n)exp(j2mnk/N) 10 Vail pe (eae Atom mS P| (2) 
k= 


Since the procedure uses successive reductions of a finite series' 
length by a factor of two, the number of samples, N, must be an integer 
power, p, of 2 giving N = De It is evident at this point that p 
reductions or passes will be necessary to obtain a non-serial solution 
for a given coefficient. 


Partitioning equation (1) at the halfway point, it appears as 


N/2-1 N-1 
F(n) = Z(k)exp(-j2mk/N) + ) 2(k)exp(- j27nk/N) (3) 
k=0 k=N /2 


substituting k =m + N/2, in the second half of the equation yields 


N/2-1 N/2-1 N N 
Bn) = yg Z(k)exp(- j2Mk/N) + a Z(m+>)expl - j21m (m+)) /N] (4) 
k=0 m=0 


and combining under a single summation 


N/2-1 
= | N er 

F(n)= | Z(k)exp(- j2mmk/N) 42 (k+>) expl - j2rm(k+>) /"")| (5) 
k=0 . 

The common exponential multiplier may be factored and after reduction 


of the internal exponential ore obtains 


N/2-1 
F(n)= [Z (k)+Z(k+N/2)exp(- jm) Jexp(- j21™mk/N) (6) 
k=0 
where 
+1 n even 
exp(-jmm) = 
-1 n odd 
giving 
N/2-1 
F(n) = | [Z(k)4Z (k4N/2) Jexp(- j27™mk/N) n even Ge 
k=0 
N/2-1 
F(n) = a [Z (k)-Z (k+N/2) Jexp(- j21™mk/N) n odd (8) 
k=0 


Initiating a new pass, equations (7) and (8) are partitioned at the 


half-range point 


N/4-1 
Ga) ae [Z(k)4Z(k+N/2) Jexp(- j21™mk/N) 
k=0 


N/2-1 
+ y [Z(k)+Z(k+N/2) Jexp(-j2mk/N) neven (9) 
Rete 
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N/4-1 
Een) = ‘2 [Z(k)-Z(kK+AN/2) Jexp(- j21™k/N) 
k=0 
“s a [Z(k)-Z(k+N/2) Jexp(- j2mnk/N) n odd 


substituting m = k - N/4, k =m +N/4, in the second summations 


N/4-1 
F(n) = ae [Z(k)+Z(k+N/2) Jexp(- j2mnk/N) 
k=0 
N/4-1 
a3 v [Z m4+N/4)4+Z (mt+3N/2) Jexpl - 72m (m+N/4)/N] n even 
m=0 
N/4-1 
F(n) = ) [Z(k)-Z (k+N/2) Jexp(- j21™k/N) 
k=0 
N/4-1 
+ 5 [Z(m4N/4)-Z (m+3N/2) Jexpl - j2rm(m+N/4)/N] n odd 
m=0 


(10) 


(11) 


(12) 


Combining under a single summation and extracting the common exponentials 


gives 
N/4-1 
F(n) = i [ Z(k)+Z(k+N/2) J4+0Z CkHN/4)4Z (k4+3N/4) Jexp(- jrn/2) 
k=0 
X exp(- j21mk/N) n even 
N/4-1 
F(n) = > | CZ (ke)-Z (kN /2) J4f Z (IeN/4)-2 (e+ 3N /4) Jexp(- jrm/2) 
k=0 
X exp(- j2Tnk/N) n odd 
where 
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(13) 


(14) 


+] n/2 even 


| -1 n/2 odd 
exp(-jtn/2) = 
pi ieee 2 even 
| + j (n-1)/2 odd 


After two passes the coefficients are defined by 


N/4-1 ; 
F(n) = y {CZ (ke) 42 CcHN/2) J4L Z (eAN/4) 42 (c+ 3N/4) J , exp(- j21mnk/N) 
K=0 | 


ny 2 even (15) 
N/4-1 , 
MiG) = 7 | £Z (kK) 4Z (kN/2) J-[Z (KAN/4) 4Z (k43N/4)] > exp(- j211k/N) 
K=0 : 
my 2 odd (16) 
N/4- | 
Go) 02k) -Z (tN /2) J- 402 Ce /4) -2Z (3/4) J exp(- j21™mk/N) 
k=0 , 
(n-1)/2 even (175) 
N/4-1 ; 
F(n) = > | [Z(k)-2(k#N/2) J+50Z (KAN/4) -Z (k+3N/4)] | exp(- j211nk/N) 
k=0 
(n-1)/2 odd (18) 


General characteristics of each reduction are now apparent. They are: 
t 
l. For the u 2 pass a complex multiplier of the form exp (- j2mn/2™) 
is generated. 
ie th 
2. The limits of the summation after the u pass are k=0 to 


k=(N/2>)-1. 
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3. For each reduction beginning with some combination of samples, 
T(k), the multiplicand of the complex multiplier is T(k+HN/2>). 


4. The general form after the gr pass is 


-1 
N/2” 

F(n)= 5 [ T(k)+exp(-j2mm/2°) T(kH/2") Jexp(- j21mk/N) 
k=0 


5. When u=p, N=2? , the limits of the summation are k=0 to k=0; 
the exponential exp(-j2tT1nk/N) becomes unit» and the non-serial solution 
Of F(a) eis sexplici£. 

> ees rd 
Now, continuing the procedure for the 3 pass, u=3: 


a. the exponential multiplier is 

exp (42nn/ am = exp(- jrmn/4); 
b. the Limits of summation are k=0 to t=(NV 22 ye ay oe 
c. the sample shift is N/2? = N/8; 


d. the general form after the third pass is 


N/8-1 
F(n)= > [ T(k) +exp(- jrn/4)T (k+N/8) Jexp(- j2mnk/N). 
k=0 
e. if N=8, 


F(n) = T(0)+exp(- jmn/4)T(N/8). 
Recalling equations (15) through (18) the following substitutions 
are made to simply bookkeeping: 
A (k) =Z (k) 4Z Ck-+N/2) B(k) =Z (k-AN/4 )4+Z (k+3N/4) 


C(k)=Z (k)-Z(k4N/2) D(k) =Z (k+N/4)-Z (k+3N/4) 
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Also, in any set of equations having common summations, the limits of 


the summations will only be shown for the first equation of the set. 


Rewriting (15-18) yields 


N/4-1 

F(n) = }) [CA(k)+B(k) Jexp(- j2mk/N) n/2 even 
F(n) = ) [A(k)-B(k) Jexp(- j2mmk/N) n/2 odd 
F(n) = ) [C(k)-4jD(k) Jexp(- j2mnk/N) (n-1)/2 even 
F(n) = y [C(k)+jD(k) Jexp(- j21™mk/N) (n-1)/2 odd 


After the third pass, the resulting equations are 


N/ 
F(n) = ) 
& 


(15a) 


(16a) 


(17a 


(18a) 


[A(k)+B(k) J+ exp(- jmn/4)[A (kK+N/8) +B (k+HN/8) ] exp (- j27mk/N) 


n/2 even (19) 
F(n) = | [A(k)-B(k) J+ exp(- jmn/4) [A (k+N/8)-B(kK+N/8) ] exp (- j27mnk/N) 
n/2 odd (20) 
F(n) = y | [C(k)- jD(k) ]+ exp(- jmn/4)[C(K+N/8) - jD(KHV/8) ] ! exp(- j2mnk/N) 
(n-1)/2 even (21) 
F(n) = mm [C(k)+jD(k) ]+ exp(- jmn/4)[C (k+N/8)+4D (k+HN/8) | exp(- j2nmnk/N) 


(n-1)/2 odd 
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(22) 


noting 


exp(-jmn/4) = <« 


| 
| 
\ 


a 
= 

+.707(1- j) 
= Gea 


ee 
7) 
- ,/07(14+)) 
tee OY Cl} ) 


n/4 
n/4 
(n-1)/4 
(n-1)/4 


(n-2)/4 
(n-2)/4 
(n-3)/4 
(n-3)/4 


Performing the complex multiplication yields 


even 
odd 
even 
odd 


even 
odd 
even 
odd 


N/8-1 | 
F(n) = ), | [A (k) +B (k) J4+LA (k4N/8) +B (k+AN/8)] |) exp(- j2mnk/N) 
k=0 


F (n) 


a 


F(n) 


iid D(k)+.707[. C(kHN/8) WD (kHN/8) ] | | exp(- j72nnk/N) 


ct 


n/4 


n/4 


7. { C(k)+.707[ C(k+N/8)-D (KN/8) J | 


(n-1)/4 


F(n) = z) | C(k)- «7070. C (k4N/8)-D (KN/8) J | 


-j : D(k)-.707[ C(CkK+AN/8) +D(k+N/8) J} joxp (edema 


F(n) 
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(n-1)/4 


(n-2)/4 


} 
{ 
“ 


4 


even 


0 [A(k) 4B (k) ]-[A (kN /8) 4B (k-N/8)] | exp(- j2mk/N) 


odd 


even 


odd 


Y, | [AG -B(k) + jLA Cct/8)-B (et/8)] | exp(- j2rmk/N) 


even 


(23) 


(24) 


(3) 


(26) 


C21), 


F(n) = Yj [A(k)-B(k) ]+j[A Cc#N/8)-B (ke#N/8)] | exp (- j2rmnk/N) 
(n-2)/4 — odd (28) 
F(n) = Ti! {ccwy- 7070 ect /8)-D (KHN/8) 
+j | D(k)- .707[ C(k+N/8)+D(k+N/8) ] | exo sana) 
(n-3)/4 even (29) 
F(n) = \ C(k)+.707[C (k+N/8)-D(K4N/8) J 


L : 
+5 { D(k)+.7 070 C(kHN/8)+D (IeHN/8)] } | exp(- j2rmnk/N) 


é 

(n-3)/4 odd (30) 
Substitution for the functions A,B,C, and D will produce explicit statements 
of the transform equations after three passes. 

Further arithmetic development is avoided for brevity with the state- 
ment of the transform equation sets for the 8 sample (N=8) and 16 sample 
(N=16) cases in Appendices I and II, respectively. Appendix III, in turn, 
is an illustrative problem in which the spectral lines of a signal are 
determined and subsequently the originating signal reproduced via the 


described transform and its inverse. 
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The procedure just presented is indeed a very simple example 
of the Fast Fourier Transform (FFT), an algorithm that allows the 
computation of a time series DFT with fewer actual arithmetic 
operations than other algorithms available. An arithmetic operation 
is defined, here, as a complex multiplication followed by a complex 
addition. 

Noting that the straight forward method of computing the DFT 
requires n? operations, Cooley and Tukey [1] showed that less than 
2Nlog,N operations are required when the FFT algorithm is used. 
They also showed that this algorithm requires no more data storage 
than the storage required for the initial samples, assuming the 
initial samples are complex. The particular method demonstrated 
herein, however, was first shown by Sande [a2] and later referred to 
as "Decimation in Frequency" by Cochran, et al, [3] because of the 
characteristic divisions of the F(n)'s after each pass. 

In Fig. 1 a flow diagram for the equations of Appendix I, the 
eight sample example, is shown. To be sure, this diagram is not 
entirely general; the assumption of real inputs avoids continual 
complex additions and allows absolute segregation of the real and 
imaginary parts of the computed coefficients. Such an assumption 
has many real world applications and succeeds in reducing the 
computation time and circuitry by at least a factor of two. 

A flow diagram for the computation of the even numbered coef- 
ficients of the 16 sample example (equations of Appendix II) is 
presented in Fig. 2. It is interesting to note that this is, in 
fact, the same data flow as shown in Fig. 1, with the exception 


of the first addition cycle. If the first subscript of each basic 
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couplet in the first pass column in Fig. 2 were used to define the 
couplet, the last statement becomes even more obvious. Figure 3 
completes the flow diagram for the 16 sample example by presenting 
the flow for the odd coefficients. Notably, this type of flow 
does not take full advantage of the FFT properties. 

The pre-multiplication of difference couplets by appropriate 
exponential factors and then continuing through the next lower order 
process was shown graphically in [3] for the eight sample case. 

Figure 4 extends that representation to the 16 sample case and 
illustrates the ease with which general purpose computers might 

handle such computations. Comparing Figures 3 and 4, the inherent 
trade-off between special purpose and general purpose implementations 
of a particular function is illustrated. The use of recursive proce- 
dures and the availability of complex arithmetic makes general purpose 
computation of the Fig. 4 process most advantageous. Whereas the 
elimination of complex multiplications and complex input data in the 
Fig. 3 process produces a more cumbersome set of equations; equations 
which are, however,more easily adapted to special purpose computation. 

It is the purpose of this work to investigate the applicabilities 
of special purpose computers to Fast Fourier Transforms. The main 
criterion of comparison between special purpose and general purpose 
computations will be time versus circuit complexity, where complexity 
includes the number of components, the size of memory, and the size 
of the control unit. 

A completely parallel, 8 "real" sample, 8 spectral line processor 
will be used as a vehicle for the "fast" special purpose machine. 


Extrapolations to greater numbers of sampled inputs and the subsequent 
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increase in spectral lines will be made with respect to the increased 
complexity and cost. Further, an investigation of hybrid sequential- 


parallel machines will be presented, 
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2. Parallel Processing 


The data flow diagrams of the eight sample example is shown in 
Fig. 1 and the explicit set of equations given in Appendix I. 

As the word "parallel" has many connotations when referring 
to computers, "full parallel" computation implies the simultaneous 
computation of all equations of the set. This does not, however, 
mean that the computation of each equation in all passes of the 
FFT must be independent. In fact, independent computation, when 
applied to the particular set of equation being examined, would be 
ridiculous. The computation in each pass is done in parallel. 

Examining Fig. 1, six basic combinations of four samples (quartet) 
and two basic couplets are found at the end of the second pass. Fo- 
cuSing attention on the quartets, for the present, the principle of 
Carry-Save Adders (CSA) is utilized. As shown in Appendix IV, CSA 
techniques may be used to obtain exceptionally fast addition of 
several arguments through logic rather than cyclic operations as in 
conventional adders. It might be added that CSA may also be used to 
implement extremely fast multiplication processes, Six 4-argument 
CSA's will be used at the front of the purposed processor. 

At this point, the real parts of F(2) and F(6) and the imaginary 
part of F(6) are available. The addition of one more rank of CSA's 
to the imaginary F(6) processor provides the imaginary part of F(2), 
the negative of the imaginary part of F(6). F(0) and F(4) are obtained 
by crossfeeding the results of the all even samples and all odd samples 
summers into two additional two-argument CSA's. The odd-sum must be 
complemented in one case, producing F(4); the non-complemented case 


yields F(0O). 


24 






"3ndut 9sATAIsod e sejedtpult sosoyjueiaed 
JO oouesqe WL “stun ysO 8y. UTYITIM UOTIOe 94 SJeEdTpUuT sasoyjuered ut su8tIs 
Teofszeumu eyuL *zosseso0id eTdures 3Yy3Te Tat Teazed-[[nyI ey. jo weisetTp 4o01[q sus 





‘¢ van3sty 
2 OZ. 
= AOJADAUT - VSO a ZZ 
oe WwW = oc ul 
G)a I WS) €)d I ‘Say ¢ 
(- 
fe (112029) = LJ 2 
20} J2AUT A9TTATI [NW VSO eye 
VSO ‘say ©) (a) (oNz WSO "Bay 7 T(-) (Gz 
C+) (22 V2 


(¢)a Te9y pue (¢)A Te9y 





(TTZ02®) 
AST TdTI [NW 


VSO 





£)d Ted pue 





103 190Au rt) Ai)2 0 = (4)4 seuz Log 
i “Cy ()z Sar WSO WSO ay 
or ‘Say Z "say Gg 
(1) Z 1)Z 
7) (92 0 = (0)a Seurz 9)z 
= VSO VSO ( 
a [eoy pue a Te9y C)Z qd 1Te0u Z)Z 
9 (Z)d T ee (O)4 T ‘Say z Gone aE 
(0) Z aE 


25 


To obtain the odd coefficients, the results of the two remain- 
ing quartet CSA's must be multiplied by the constant..70711. Multi- 
plication is a shift and add process, dependent on the position of 
"ones" in the binary multiplier. Since -70711, 5 = - 1011010100, the 
shift would be to the right and five additions necessitated. 

For example, the multiplication of an eight bit quantity M by 
.70711 would be represented as: 

Me MoM. Mave lial M 


, 6 “Sewer |. (O 


ML Me M. M) Mt. M, M, ae 


M M 
Les Ee 3 a 


MW Me M. A MS Ay My Mf 


i Milde abs es By Rs 


*7 %6.%5 Mats 7200 = Wle2l a= Geom ymeete 


A CSA multiplier for this example would promulgate the carrys from the 


-6 digit forward and require a maximum of five arguments (the P_ bit). 


1 
The maximum number of delays for multiplication by the given constant 
would then be 5+k for a k bit multiplicand. 

Performing the multiplication in two Carry-Save Adder (CSA) 
multipliers then allows the crossfeeding directly into four 3-argument 
CSA adders. This gives the respective combination for the real parts 
of the odd coefficients and the imaginary parts of F(3) and F(7). 
Complementing the imaginary parts of F(3) and F(7) by an additional 
CSA rank on each of the sums yields the imaginary part of F(1) and 
Eye 

Having outlined the complete processor verbally, the block diagram 


is presented in Fig. 5. Note, the computation of the imaginary parts 


of F(1) and F(5) are the most involved; therefore, the time analysis is 
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based on those computations, For the sake of real time estimation, 

a 12-bit word and the availability of the 20 nsec adders mentioned in 
Appendix IV are assumed. The number of delays to the final stage is 
taken from Appendix IV and 11 ripple delays are assumed for the 12- 
bit word, 

The initial four argument processor required three delays, the 
multiplication four delays, the final three argument addition two 
delays, and the complement one delay for a total of 10 delays to the 
final stage, Adding 11 ripple delays gives 21 units of delay for a 
420 nsec total time of computation for the eight spectral lines of 
a signal. 

A maximum of k - 1 full adders are required for each bit of a 
k argument CSA, A measure of complexity, the total number of adders, 
may be calculated, Thirty-six adders are required for each of the 
six, 4-argument, 12-bit units for a total of 246 adders for all six 
units. For each of the two argument summations and the three negation 
operations, 12 adders are required for a total of 60. For each of the 
four 3-argument units following the multipliers, 24 adders are required 
for a total of 96. Finally, for each multiplier 45 adders are needed 
giving a 90 adder total. The processor then requires a grand total 
of 492 adders. A completely parallel computation of the eight Fourier 
coefficients takes less than .5 usec. using five hundred 20 nsec 
full adders. 

Since the initial information may be stored in the same location 
as the eventual answers, 192 bits of immediate access memory is needed. 
No control units other than a .5 usec clock and a memory read and write 
logic are required. If J-K flip-flop memory is used, only one AND 


gate is needed for control of the memory. 
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Extension of the methods of computation shown for the eight 
sample example to a sixteen sample example may be accomplished with 
one basic assumption; an average of five additions is required for 
each constant multiplier. 

Referring to Fig. 2, the simplest method of obtaining the even 
coefficients would be to provide eight parallel adders in front of 
an eight sample processor. For the specified 12-bit word, this 
requires an added 96 adders to the basic 492 and would add 20 nsec 
to the computation time. 

Computation of the odd coefficients becomes slightly involved. 
It is found that 1076 adders are required and an approximate compu- 
tation time of 500 nsecs achieved (based on the multiplication 
assumption stated above). 

Thus a 16-sample processor would require a total of 1664 adders 
and would produce the 16 complex spectral lines in 500 nsec. It is 
quite evident that extension of the "full parallel" computation to 
higher number of imputs becomes extremely expensive in terms of the 
number of adders required. However, the computation times are 
extremely fast. It is estimated that almost two million adders would 
be required for the 1024-sample case. The computations would still 
be accomplished in less than 2 usec. 

It must be remembered that the figures given in this section 
are for real input values, not complex. To further extend the 
processors to handle complex input signals would require an approximate 
25% increase in adders and a 10% increase in processing time. Thus, 
the eight and sixteen sample processors would require approximately 
600 and 2000 adders respectively for the computation of coefficients 


iene IISc . 
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3. Hybrid Processing 


A FORTRAN IV program used to compute the Fast Fourier Transforms 
on an IBM-360 system was used to give some indication of a fast serial 
computation of the coefficients for various numbers of sample. The 
average time to perform an arithmetic operation (complex addition 
followed by complex multiplication) was estimated at 22 usec. From 
that average operation time, the maximum computation times for several 
different numbers of samples were computed and are shown in Fig. 6. 

The question of combining the speed of parallel processing with 
the simplicity of serial operation must be answered with a compromise. 
It is quite evident that for 16 samples or more, straight parallel 
computation is unreasonably expensive. With a very low hypothetical 
price of $10.00 per adder, a 16 sample processor's arithmetic unit 
alone would cost $20,000.00. Consider, however, a unit which would 
compute 16 coefficients according to the flow in Fig. 4 by using a 
single full-parallel eight complex sample processor and serially 
processing the basic couplet's arithmetic and complex multiplications. 

A general 12 x 12 bit CSA multiplier that will give the product 
in 360 nsec may be constructed with 132 adders and 156 AND gates. 

With four such multipliers and two simple adders a complex multiplier 
is constructed that will provide the product of two complex quantities 
in 380 nsec. Since the basic couplets characteristically fall into 

sum and difference couplets (see the first pass column in Fig. 2) it 
seems reasonable to compute both at the same time. By adding a complex 
subtractor (two words) before the multiplier and a parallel adder, the 
sum couplet and the difference couplet with complex multiplication are 


computed in 400 nsec. Such a unit would require approximately 575 


2g 


adders and 624 AND gates. If 9 AND gates are considered equivalent 
to one full adder, the above circuit is equivalent to 640 adders. 

At this point the "hybrid" processor consist of one eight- 
complex-sample processor and one arithmetic unit that computes 
the sum couplet, the difference couplet, and multiplies the dif- 
ference couplet by a complex constant in a single 400 nsec operation. 

For the 16 sample problem of Fig. 4, the arithmetic unit would 
only be required to cycle through eight calculations (First Pass), 
then the eight sample unit would cycle twice. Assuming a .25 
usec read and write time, one complete arithmetic operation could be 
completed in less than 1.25 usec. The eight arithmetic operations 
would require 10 usec and the two cycles of the eight sample unit 
9 usec (4.5 usec per cycle) for a 19 usec computation time. The 
above calculation times have been extended for various numbers of 
sampled inputs and the results are shown in Fig. 6. 

The hybrid processor's arithmetic unit at this point consist of 
a 1240 adder unit. However, the processor must have a control unit 
and a control memory. The control memory requires a maximum of N/2 
complex multipliers plus the control program. The control unit must 
be able to index the pass numbers, the arithmetic operations, and the 
eight sample process pass number. Assuming a single word in memory 
is equivalent to a full adder in cost and complexity, a possible basis 
of comparison is achieved. Since two words of memory are needed for 
each multiplier there will be N words or, equivalently, N adders 
required in the control memory of a N sample unit. An assumption of 
a 200 adder equivalent control unit, including the control program 


memory 1S a reasonable estimate. For the range of sample inputs from 
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8 to 1024 this value should remain relatively constant. In general, 
the control unit then requires the equivalent of N + 200 adders. 

The 16 sample hybrid processor is the equivalent of 1450 full 
adders. Figure 7 illustrates the relative complexities of the hybrid, 
the parallel, and the serial processor. The basic requirements for 
sample input and coefficient output memory is the same for all three 


processes and is not included in the complexity figures. 
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4. Conclusions 

Figure 6 graphically shows the superiority of the parallel 
processor with respect to speed of computation while Fig. 7 demon- 
strates the enormity of size and complexity required to obtain those 
speeds. In reverse, Fig. 7 shows the serial processor with the 
complexity advantage and being extremely slow in computation, as 
shown in Fig. 6. 

The combining of serial and parallel operations succeeded in 
incorporating the better qualities of each. Certainly, the com- 
plexity of the hybrid system follows the same trend as the serial 
processor in that it is relatively flat in the log-log plot. This 
was to be expected. In presenting the hybrid system, the major 
nature of the system is serial. The parallel characteristics are 
found in the handling of several computations at one time and reducing 
substantially the arithmetic operation time by batch processing 
through the eight sample full-parallel processor in the final stages, 
On the log-log plot of Fig. 6 the slope of the hybrid curve is slightly 
greater than that of the serial curve. However, at the last point 
presented (1024 samples) the hybrid processor is still 500 times 
faster than the serial unit. 

It is evident from the computation times shown that definite 
applications for the special purpose handling of FFT's do exist. In 
fact, for many applications, real-time analysis is possible. A central 
problem remains, that of analog to digital conversion of the data in 


times that will be able to fully utilize the speed of the processors. 
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APPENDIX I 


EQUATIONS FOR EIGHT SAMPLE APPLICATION 
The following set of equations defines the DFT for n=8. The 
equations are expressed in terms of the sample number (sample subscript) 
rather than using the subscripted notation of the Introduction. In 
this notation the couplet [Z(0)+Z(4)] would appear as (0+4), and non- 


integer numbers are multiplicative constants. 


F(O) = [ (0+4)+(246)] + [(145)+(347)] 

F(1) = '(0-4)+.707[ (1-5)-(3-7)] -j [ (2-6)+.707[ (1-5)+(3-7)] 
F(2) = [ (0+4)- (246) ]- j[ (1+5)- (347) ] 

F(3) = | (0-4)-.707[ (1-5)-(3-7)] | if | (2-6)- .707[@1-5)+(3-7)] 
F(4) = [ (0+4)+(2+6) ]-[ (1+5)+(347) ] 

F(S) =; (0-4)-.707[ (1-5)-(3-7)] } 5 | (2-6)-.7070 (1-5) +(3-7) J 
F(6) = [(0+4)- (246) J+j[ (14+5)- (3+7)] 

F(7) = | (0-4) +.707[ (1-5)= (3-79) | +5 1(2-6)+.707[ (1-5)+(3-7)] 
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APPENDIX II 


EQUATIONS FOR SIXTEEN SAMPLE APPLICATION 
The set of equations beginning on the following page defines 
the DFT for N=16. As in Appendix, I, the equations are expressed 


in terms of the sample number. 
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APPENDIX III 


ILLUSTRATIVE EXAMPLE OF THE DISCRETE 
FOURIER TRANSFORM AND ITS INVERSE 
A time signal, £(t), is sampled eight times at equally spaced 
intervals. For convenience, the signal ie a sinewave and the 


interval tt/4 is chosen. 


Ww 
if 





The sampled signal appears as 


ZS) 
1 
Oy | | | 


| | | 
-.707 + 
-1- 
| 
Z(0)=Z(2)=.707 ZG) =1 
Z(4)=Z(6)=-.707 Z.(5)=-1 
Z(3)=Z(7)=0 


Forming the basic couplets of the equations in Appendix I. 


Z(0)+Z(4)=0 Z(0)-Z(4)=1.414 
Z(2)+Z(6)=0 Z(2)-Z(6)=1.414 
Z(1)+Z(5)=0 Z(1)-Z(5)=2 
Z(3)+Z(7)=0 Z(3)-Z(7)=0 
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Now, solving for the coefficients 


F(0)=(0+0)+(0+0) = 0 


F(1)=[1.414+.707(2-0) ]- jL 1.414+.707(240)] = 2.828(1-j) 
F(2)=(0-0)-j(0-0) = 0 
F(3)5(1.414-.707 (2-0) J+jL1.414-.707(2+0)] = 0+j0 = 0 
F(4)=(0+0)-(0+0) = 0 
F(5)=[1.414-.707 (2-0) ]- j[1.414-.707(24+0)] = 0-j0 = 0 


F(6)=(0-0)+j(0+0) = 0 

F(7)=[1.41+.707 (2-0) ]+j[1.414+.707(2+0)] = 2.828(1+}) 

If the foregoing procedure is correct, application of the inverse 
transform should produce the original sample values. Noting that the only 
non-zero coefficients are F(1) and F(7), the inverse is written 

Z(k)=1/8 F(1)exp( jmk/4)+F(7)exp(jmk/4) 
and the Z(k)'s are 

Z(0)=(2.828/8)[ (1-4)+(1+j)] = .707 

Z(1)=(2.828/8)[ (1-4) (.707) (1+j)+(1+j) (.707) (1-j)] = 1 

Z(2)= (2.828/8)[ (1- 4) (j)+(1 44) (-j) ] = «707 

Z.(3)=(2.828/8)[ (707) (1- 4)“+(- .707) (14j)“] = 0 

Z(4)=(2.828/8)[ (1-4) (-1)+(1+)j) (-1)] = -.707 

Z(5)=(2.828/8)[ (- .707) (1+j) (1- j) +(-.707) (1- j) (145) ] = -1 

Z(6)=(2.828/8)[ (1- j) (- j)+(14+j) (j)] = =.707 

AE On VI Omen Gleq) CupyCespe | = © 

The above Z(k) match precisely the original data; thus showing the 


correctness of the development 


41 


APPENDIX IV 
CARRY-SAVE ADDERS 

Basically a full adder has three inputs (two addends and a 
en, and two outputs (sum and carry-out). Thus, it may be 
looked upon as a reduction process of from three to two arguments. 
Considering the "carry-in" as a third addend, it is apparent that 
the addition of more than two quantities may be handled by cascading 
the afore mentioned reduction process. Such implementation of full 
adders has led to both the process and the adders themselves being 
called Carry-Save Adders (CSA). 

If more than three addends are being summed more than one 


initial CSA would be required. The figure below schematically shows 


the reduction process for a six argument addition. 





Schematic representation of a six argument Carry-Save Addition 


Note that this is strictly a representation of the reduction process. 
In the actual process the "carry-out" of each adder must go to the 
next higher significant bit train. Its position, as shown in the 
diagram, would be taken by the carries from the next lower significant 
bit train. It is noted here that the CSA process generally requires 
k-l units for the reduction of k addends. The final stage (level) of 
the CSA process is of interest since the propagation of the carries 


are of a "ripple" nature. 
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The following diagram presents a complete four-bit adder 


for four addends (A,B,C, & D). It is obvious that the CSA process 


Zormn rico 
Je 
op) 
eq 
op 


Block diagram of a four-bit, four addend, 
Carry-Save Adder (an addition of four 4-bit words) 


is strictly a logic function, With the assumption of a realistic 
signal delay time through each logic level, the total computation 

time may be calculated. Although pressing the state of the art 

at this time, an assumption of a 20 nsec full adder is not unrealistic. 
Thus, for the given adder, three levels of delay plus the final carries' 
ripple through two delay levels gives a total of five delays, or a 100 
nsec addition of four, 4-bit arguments. 

With increasing word lengths the computation time increases 
linearly for the CSA process. This characteristic is the result of 
adding one more stage of carry ripple for each additional bit. For 
a 12-bit version of the four argument adder, there would be 13 units 


of delay and an addition time of 260 nsec. 
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The use of carry look ahead techniques can reduce the ripple 
time by a factor of two per stage of delay. For the 12-bit example, 
the three initial units of delay are summed with 10 half-units for 
8 delays and a 160 nsec addition of four, 12-bit, quantities is 
possible. 

Finally, ones complement arithmetic is appropriate since it is 
easily obtained by using the two's complement and just adding one 
in the least significant bit train. 

The following table gives the number of delays as a function of 


the number of arguments in a CSA process. 


Number of Arguments Number of Delays 
2 6 » 4 &. Se. See 
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